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Abstract. The storage capacity of an incremental learning algorithm for the parity 
machine, the Tilingiike Learning Algorithm, is analytically determined in the limit of a 
large number of hidden perceptrons. Different learning rules for the simple perceptron 
are investigated. The usual Gardner-Derrida one leads to a storage capacity close to 
the upper bound, which is independent of the learning algorithm considered. 



I INTRODUCTION 

The storage capacity is one of the most important characteristics of a neural 
network. It is the maximal number of random input-output patterns per input 
entry that a network is able to correctly classify with probability one. This quantity 
is independent of the algorithm used to learn the weights of the network; it only 
depends on its architecture. We will refer to it as the architecture storage capacity 
Q,arch ^YiQ following, to distinguish it from the algorithm-dependent one, hereafter 
called algorithm storage capacity, af^. 

The simplest neural network, the perceptron, has the inputs directly connected 
to the output. Geometrical arguments [1] and a statistical mechanics calculation [2] 
determined that a'^'^'^^ = 2. Several perceptron learning algorithms, like the Ada- 
tron [3,4], are known to achieve such storage capacity. 

The capacity can be increased using networks with more complicated architec- 
tures. The next one on increasing complexity is the extensively studied [5] mono- 
layer perceptron (MLP), which has k "hidden" perceptrons connected to the output 
unit. As a MLP can store any function of its inputs, provided that the number of 
hidden units is adequate, it is not worth to consider more complex architectures 
for the storage problem. 

Given an input pattern, the hidden units' states define a fc-dimensional vector, 
the pattern's internal representation (IR). The network's output is a function of 
the IR. In the following, we consider binary neurons, of states ihl, and focus on the 
parity machine, whose output is the product of the k components of the IR. 
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The main problem when training MLPs is that the IRs of the input patterns are 
unknown. It has been proposed [6] to build the hidden layer using a constructive 

procedure, called Tilinglike Learning Algorithm (TLA), in which the hidden per- 
ceptrons are included one after the other and trained to correct the learning errors 
of the preceding unit. As each unit can at least correct one error, convergence is en- 
sured [6] . It is straightforward to show that the TLA generates a parity machine [6] , 
but the number of included hidden units depends crucially on the performance of 
the perceptron learning algorithm used to train them. 

Geometric arguments [7] and a statistical mechanics replica calculation [8], 
showed that the architecture storage capacity of the parity machine in the limit 
of a large number of hidden units k is a'^'^'''^{k) ~ kink /In 2 ~ 1.44 A; In A;. How- 
ever, it was not clear whether this storage capacity could be actually achieved 
with a learning algorithm. In this paper, we show that the Tilinglike Learning 
Algorithm (TLA) can reach a storage capacity close to the architecture storage 
capacity, provided that the hidden perceptrons are trained with an appropriate 
learning algorithm. In section II we describe more precisely the setting and the 
TLA. The analytical expression of the algorithm storage capacity in the limit of 
large k is determined in section III, where we show that the learning algorithm used 
to train the hidden units must satisfy stringent condition for the TLA to converge 
with a finite number of hidden units. The results presented in section IV show that 
these conditions rule out some perceptron learning algorithms, like the Adatron. 
The conclusions are presented in section V. 

II THE TILINGLIKE LEARNING ALGORITHM 

Let us assume a training set — {^m; 'rM}^=i ... p of P — aN input-output 
patterns. The inputs x^^ are random gaussian A'^-dimensional vectors with zero 
mean and unit variance in each direction. The corresponding outputs = ±1 are 
the learning targets. Their values r are randomly selected with probability: 

P{r;e)^eS{T + l) + {l-e)S{T-l), (1) 

with e = 1/2. The role of the bias e introduced in (1) will become clear in the 
following. The probability of the targets is unbiased. 

The TLA constructs the parity machine by including successive perceptrons in 
the hidden layer. Each unit is connected to the input x through weights {J, 9} = 
( Ji, . . . , Jn, 0) where ^ is a threshold. The inputs are classified through a = sign(J • 
X — 6*). Thus, a perceptron separates linearly the input space with a hyperplane 
orthogonal to J (we assume J • J = 1) at a distance 6 to the origin. The weights 
and the threshold are learned through the minimization of a cost function: 

£;({J,e};£„) = f:y(r^(J.x^-e)). (2) 



where the potential V{\) is the contribution of each pattern to the cost, and |A| is 
the distance of the pattern to the hyperplane. 

Within the TLA heuristics, the first perceptron A: = 1 is trained to learn 
targets = r^. After learning, its weights are {J^,^^^}; its training error is 
e] = (1/-P) I]^0(— o'^T^) where Q{x) is the Heaviside function and (j^ the percep- 
tron's output to pattern /x. If e] = 0, the training set is correctly classified; the TLA 
stops with only one simple perceptron. Otherwise, a new perceptron is introduced. 
The successive pcrccptrons i are trained to learn training sets j0.a{i) = {x^i t^} with 
targets r^' = tI^^cf''^^ ^ that is, = 1 if the pattern /i is correctly classified by the 
previous perceptron and = — 1 otherwise. If the perceptron learning algorithm 
is correctly chosen it can be shown that the successive training errors e\ are strictly 
decreasing [6,9]. Thus, the TLA procedure necessarily converges to a MLP with 
k units, where the k*^ perceptron is the first one to meet the condition = 0. 
Then, the product cf^ — a^^ - ■ ■ — gives the correct output to the patterns of 
the training set C^- 

III STORAGE CAPACITY 

The algorithm storage capacity of the TLA, q;°'^(/c), is simply the inverse function 
of k{a), the average number of perceptrons typically included by the TLA when 
the training set has a size a = P/N. In order to determine k{a), consider the i*'* 
hidden unit : The probabihty of its targets depends on the training error el~^ 
of the previous perceptron. Although there exist some correlations between the 
outputs r^, due to the correlations in the weights of the successive perceptrons, 
in the limit of a large training sets {a — > oo) they may be neglected [10]. Thus, 
we may assume that the targets are independently drawn with probability (1), 
with a bias el~^. Then, the successive training errors satisfy a simple recursive 
relation [10-12]: 

el = St {a, el') (3) 

where St{a, e^'^) is the training error of a simple perceptron trained with a training 
set of size a and biased targets drawn with a probability P{Tj^;el~') given by 
(1). The number k of perceptrons necessary to correctly classify the initial training 
set satisfies [11,12]: 

Ofe/a(l/2) = /«o...o/, (l/2) = (4) 

k times 

where /«(£:) stands for St{a,e) and the symbol o for the composition of functions. 
In the limit of a large a, the training error St{a, e) is close to e and the number of 
simple perceptrons k{a) is large. It is thus possible to use the continuum limit: 
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FIGURE 1. Successive training errors ej. The full curve corresponds to £t{a,e). In this case 
six perceptrons are necessary for the convergence of the Tilinglike Learning Algorithm. 

with X = i/k. After integration of this differential equation we obtain [11,12] the 
typical number of hidden units introduced by the TLA: 



k{a) depends on the specific cost function (2) used to train the perceptrons through 
£t{a,e), which is the training error of a simple perceptron learning a training set 
with biased targets. 



In this section we determine the perceptron's training error St{a,e) for biased 
training sets, in the thermodynamic limit (N — > oo with a — P/N fixed). For 
different possible choices of the potential V in (2), we deduce the number of hidden 
units, k{a; V) and the algorithm storage capacity of the TLA. Only the main results 
are presented here, the interested reader can find the details in [11,12]. 



The Adatron potential is F(A) = («; — A)^9(«; — A), where k is a positive pa- 
rameter called stability. All the patterns with negative A and those with A > 
but closer than k, to the hyperplane (which are correctly classified) contribute to 
the cost. The training error £t{o:,e) is obtained through a replica calculation as- 
suming replica symmetry (RS), which can be shown to hold for all a. It turns 
out that for fixed k, in the limit of large a, £t{oi, e) > e. As a consequence, the 
constraint that the successive training errors are strictly decreasing, necessary for 
the convergence of the TLA, is not satisfied. This problem can be circumvented at 




(6) 



IV RESULTS FOR DIFFERENT LEARNING 

POTENTIALS 



1 Adatron cost function 



the price of considering free parameter, and minimizing the training error 

with respect to it. In that case, St{a, e) < e and the algorithm storage capacity is 
a.f^{k, Adatron) ~ 4.55 In A; in the hmit of large a. 



2 Gardner- Derrida cost function 

The potential of the Gardner-Derrida (GD) cost function [13] is V^(A) = Q{k — \). 
The hypothesis of RS is incorrect for this potential, and the obtained value of £t 
is a lower bound to the true training error. Consequently, the replica calculation 
allows only to determine an upper bound to af^{k). If k = 0, the cost is nothing 
else but the number of misclassified patterns. It gives the lowest bound to £t. In 
the limit of large training set size a, we obtain: 

CI \ 2 Ina 

a^[£) a 

where a{e) satisfies e = [e"(l — a) — l]/[2 (cosh a — asinha — 1)]. This leads to 
k(a,GD) ~ 0A75a/\na and afs(k,GD) ~ 2.nk\nk, larger than a'^''''''(k), prob- 
ably due to the failure of the RS hypothesis. 

In order to obtain a lower bound to a'^^'^^{k) we used the Kuhn- Tucker cavity 
method [14,11,12], which gives an upper bound to £t. As a result of both calcula- 
tions, we can bound af^{k, GD): 

0.924k < a'^^^{k,GD) < 2.11 kink. (8) 

On view of this result, we expect that the algorithm storage capacity of the TLA 
behaves like A;(lnA;)'^ with < u < 1. A calculation with one step of replica 
symmetry breaking would give an estimate of the exponent i/. Since the RS solution 
gives a better approximation of the training error than the Kuhn-Tucker cavity 
method, we expect the exponent u to be close to 1, leading to an algorithm storage 
capacity close to the architecture's capacity. This result shows that the TLA may 
build a nearly optimal network. 

In order to improve the robustness against noise in the data, it is usually useful 
to impose some finite stability k, to the patterns. The corresponding GD potential 
is V{X) = 0(k — A). As with k = 0, here also the RS solution is unstable. The 
bounds on a'^'-^{k, k,,GD) deduced from the results for £t obtained with the RS 
hypothesis and the Kuhn-Tucker cavity method [11,12], give: 

^ <af%k,K,GD)<^^. (9) 



Strikingly, imposing a finite stability k has an important effect on the algorithm 
storage capacity, which in this case behaves as k{\nk)''' with —1/2 < < 0. The 
prefactors of the bounds of af^{k, k, GD) are K-dependent and they both diverge 
for K — ^ 0. The exponent u is independent of k for finite k but differs from the one 
corresponding to k = 0. 



V CONCLUSION 



We determined analytically the storage capacity of the Tilinglike Learning Al- 
gorithm for the parity machine, a constructive procedure generating a monolayer 
pcrceptron of binary hidden units. A training set of input-output examples is used 
to determine the number of hidden units, which are introduced one after the other. 
These are simple perceptrons that have to learn their weights using increasingly 
biased target distributions. 

We have shown that the storage capacity of the TLA depends crucially on the 
learning errors of the successively introduced perceptrons. The properties of the 
algorithm used to train the latter have thus dramatic consequences on the size of 
the hidden layer generated by the TLA, and may even hinder the convergence to a 
finite size network. This arises, in particular, if the perceptrons are trained with the 
Adatron algorithm unless the stabihty is adapted to the successive targets' biases. 

The smallest network is obtained using the Gardner-Derrida cost function with 
vanishing stability, which corresponds to minimizing the training error. Based 
on the results obtained within the replica symmetry hypothesis, and those us- 
ing the Kuhn- Tucker cavity method, we expect a supra-linear storage capacity 
af^{k,GD) ~ A;(lnA;)'^ with i/ > 0, very close to the theoretical capacity corre- 
sponding to the architecture considered. 
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